{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using `partis` to generate synthetic datasets\n",
    "\n",
    "This notebook explains how to use the software [`partis`](https://github.com/psathyrella/partis) to generate synthetic datasets.\n",
    "\n",
    "Note that you will also need [Change-O](https://changeo.readthedocs.io/en/version-0.4.4/) toolkit,\n",
    "`geiger`, `ape`, `TreeSim` in R language."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import os\n",
    "import sys\n",
    "\n",
    "from subprocess import call, check_output, STDOUT, check_call\n",
    "\n",
    "from icing.externals.DbCore import parseAllele, gene_regex, junction_re\n",
    "from icing.utils import io\n",
    "\n",
    "# configuration\n",
    "partis_path = # PARTIS PATH \n",
    "output_path = # OUTPUT FOLDER \n",
    "makedb_path = # MakeDb.py PATH\n",
    "n_iter = 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# need to install geiger, ape, TreeSim in R for this to work\n",
    "for i in [1500, 3000, 5000]:\n",
    "    for j in [400]:\n",
    "        call('{1}/bin/partis simulate '\n",
    "             #'--parameter-dir {1}/test/reference-results/test/parameters/data/ '\n",
    "             '--simulate-partially-from-scratch '\n",
    "             '--outfname {3}/clones_{0}.{2}.csv --n-sim-events {0} --n-leaves {2} '\n",
    "             '--indel-frequency 0.05 --indel-location cdr3 --mean-indel-length 6 '\n",
    "             '--n-procs 16'.format(i, partis_path, j, output_path).split())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This produces into `output_path` folder a list of RAW sequences.\n",
    "\n",
    "To simplify the processing of IMGT/HighV-Quest, let's a unique `fasta` file where, in the `ID` string, there is also the identity of the original database name. This will allow us to recover our original databases splitted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "files = [os.path.join(output_path, x) for x in os.listdir(output_path) if x.endswith('.csv')]\n",
    "\n",
    "# 1. create a single pandas dataframe\n",
    "db_s = []\n",
    "for x in files:\n",
    "    df = pd.read_csv(x, index_col=None)\n",
    "    df['db'] = x.split('/')[-1]\n",
    "    db_s.append(df)\n",
    "\n",
    "df = pd.concat(db_s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2. create fasta file up to 500k sequences\n",
    "for i in range(df.shape[0] / 500000 + 1):\n",
    "    with open(os.path.join(output_path, \"all_{}.fasta\".format(i)), 'w') as f:\n",
    "        for index, row in (df.iloc[i * 500000:(i+1)*500000].iterrows()):\n",
    "            f.write(\">\" + \"_\".join([row['db']] + [str(a) for a in row.values[:-8]]))\n",
    "            f.write(\"\\n\")\n",
    "            f.write(row['seq'])\n",
    "            f.write(\"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok! Now we can use IMGT to convert our `fasta` file(s) into databases which we can use as input to ICING.\n",
    "To do so, connect to IMGT HighV-Quest software and upload the data.\n",
    "\n",
    "When finished, an email will notify that results are ready. Now, download them and extract the \"txz\" files as folders to use them with Change-O `MakeDb` script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash -s \"$output_path\" \"$makedb_path\"\n",
    "# run Changeo to convert IMGT into fasta file\n",
    "# python MakeDb.py imgt -i <imgt output, zip or folder> -s <original fasta file> --scores\n",
    "for i in {0..12}\n",
    "  do \n",
    "     python $2/MakeDb.py imgt -i $1/imgt-pass/partis_6_$i -s $1/fasta/all_$i.fasta --scores\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Divide now the IMGT-ChangeO processed files into a final list of databases which are usable from our method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = os.path,join(output_path, 'makedb-pass')\n",
    "db_s = []\n",
    "for f in [os.path.join(path, x) for x in os.listdir(path) if x.endswith('db-pass.tab')]:\n",
    "    db_s.append(pd.read_csv(f, dialect='excel-tab'))\n",
    "\n",
    "df = pd.concat(db_s)\n",
    "    \n",
    "# add the mutation column\n",
    "df['MUT'] = (1 - df['V_IDENTITY']) * 100.\n",
    "\n",
    "df['DB'] = df['SEQUENCE_ID'].str.split('.csv').apply(lambda x: min(x, key=len))\n",
    "for i in df.DB.unique():\n",
    "    df[df.DB == i].to_csv(os.path.join(path, str(i) + '.tab'), index=False, sep='\\t')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's produce an overview of the datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = os.path,join(output_path, 'datasets')\n",
    "\n",
    "df_all = pd.DataFrame()\n",
    "for f in sorted([os.path.join(path, x) for x in os.listdir(path) if x.startswith('clones_')]):\n",
    "    df = io.load_dataframe(f)\n",
    "    \n",
    "    df['true_clone'] = [x[3] for x in df.sequence_id.str.split('_')] \n",
    "    row = {}\n",
    "    row['n_seqs'] = int(df.shape[0])\n",
    "    \n",
    "    df['true_v'] = [parseAllele(x[4], gene_regex, 'first') for x in df.sequence_id.str.split('_')] \n",
    "    df = df.loc[['OR' not in x for x in df.true_v], :]\n",
    "        \n",
    "    df['true_d'] = [parseAllele(x[5], gene_regex, 'first') for x in df.sequence_id.str.split('_')] \n",
    "    df = df.loc[['OR' not in x for x in df.true_d], :]\n",
    "    \n",
    "    df['true_j'] = [parseAllele(x[6], gene_regex, 'first') for x in df.sequence_id.str.split('_')] \n",
    "\n",
    "    row['unique V genes'] = int(df.true_v.unique().size)\n",
    "    row['unique D genes'] = int(df.true_d.unique().size)\n",
    "    row['unique J genes'] = int(df.true_j.unique().size)\n",
    "    \n",
    "    row['unique functional V genes'] = len([x for x in set(df.true_v) if 'OR' not in x])\n",
    "    row['unique functional D genes'] = len([x for x in set(df.true_d) if 'OR' not in x])\n",
    "    row['unique functional J genes'] = len([x for x in set(df.true_j) if 'OR' not in x])\n",
    "    \n",
    "    row['database'] = f.split('/')[-1]\n",
    "    row['clonotypes'] = int(df.true_clone.unique().size)\n",
    "    row['avg seqs/clone'] = np.mean([len(x) for x in df.groupby('true_clone').groups.values()])\n",
    "        \n",
    "    row['mean (std) of V gene mutation'] = \"%.2f (%.2f)\" % (df.mut.mean(), df.mut.std())\n",
    "    df_all = df_all.append(row, ignore_index=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_all['indexNumber'] = [int(i.split('_')[-1].split('.')[0]) + int(\n",
    "    i.split('_')[-1].split('.')[1]) for i in df_all.database]\n",
    "# Perform sort of the rows\n",
    "df_all.sort_values(['indexNumber'], ascending = [True], inplace = True)\n",
    "# Deletion of the added column\n",
    "df_all.drop('indexNumber', 1, inplace = True)\n",
    "\n",
    "df_all['avg seqs/clone'] = df_all['avg seqs/clone'].map('{:.2f}'.format)\n",
    "\n",
    "df_all[['n_seqs', 'clonotypes', 'unique V genes', 'unique D genes', 'unique J genes',\n",
    "       'unique functional V genes','unique functional D genes','unique functional J genes']] = df_all[\n",
    "    ['n_seqs', 'clonotypes', 'unique V genes', 'unique D genes', 'unique J genes',\n",
    "     'unique functional V genes','unique functional D genes','unique functional J genes']].astype(int)\n",
    "\n",
    "sorted_df = df_all.loc[:, ['database', 'n_seqs', 'clonotypes', 'avg seqs/clone', 'unique V genes',\n",
    "               'unique D genes', 'unique J genes',\n",
    "                           'mean (std) of V gene mutation']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save the results in the current folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sorted_df.to_latex(\"dataset_table.tex\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "Other useful visualisations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = io.load_dataframe(sorted([os.path.join(path, x) for x in os.listdir(path) if x.startswith('clones_')])[-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['true_v'] = [parseAllele(x[4], gene_regex, 'first') for x in df.sequence_id.str.split('_')] \n",
    "df['true_d'] = [parseAllele(x[5], gene_regex, 'first') for x in df.sequence_id.str.split('_')] \n",
    "df['true_j'] = [parseAllele(x[6], gene_regex, 'first') for x in df.sequence_id.str.split('_')] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set of V genes\n",
    "set(df.true_v)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sorted_df['unique V genes'] = ['%d (%d)' % (a,b) for a, b in zip(sorted_df['unique V genes'],\n",
    "                                                        sorted_df['unique functional V genes'])]\n",
    "sorted_df['unique D genes'] = ['%d (%d)' % (a,b) for a, b in zip(sorted_df['unique D genes'],\n",
    "                                                        sorted_df['unique functional D genes'])]\n",
    "sorted_df['unique J genes'] = ['%d (%d)' % (a,b) for a, b in zip(sorted_df['unique J genes'],\n",
    "                                                        sorted_df['unique functional J genes'])]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "del sorted_df['unique functional V genes']\n",
    "del sorted_df['unique functional D genes']\n",
    "del sorted_df['unique functional J genes']\n",
    "del sorted_df['prova']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[['true_v', 'v_call']].iloc[0:72]\n",
    "df[df['true_v'] == 'IGHV3-11', ['']]"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
